The workflow was developed using Command Line Terminal and UNIX resources in MacOS.
Downloading database date from UniProt. In the following lists, only Long Wavelength, Short Wavelength, and Ultraviolet were downloaded. Also, sequences length ranged from 201 to 400
https://www.uniprot.org/uniprotkb?facets=length%3A%5B201%20TO%20400%5D&query=opsin%20ephemeroptera
https://www.uniprot.org/uniprotkb?facets=length%3A%5B201%20TO%20400%5D&query=opsin%20ephemeroptera
Table 1. A list of all
the Uniprot sequences used.
gunzip if
necessary://input
gunzip uniprot-compressed.fasta.gz #uncompressing
ll uniprot* #this command to list all of the uniprot files.
grep -c '>' uniprot* #this will give us a glimpse for how many protein sequences we have in both files
//output
uniprot_eph.fasta:12
uniprot_odo.fasta:8
//input
cat *.fasta >> uniprot_opsin_db.fasta #concatenating / merging both files / naming the new file uniprot_opsin_db.fasta
#after this, we check...
grep -c '>' uniprot_opsin_db.fasta
#the output should be 20, so the merge was successful.
#now we do the actual moving of the database files to our blastdb directory created in the root directory.
mv uniprot_opsin_db.fasta $HOME/blastdb
#this contain the list of proteins of interest (the genes that code for opsins. )
All sequences consolidated in one file.
NOTE: MAFFT v7.490 (Katoh et al, 2013) must be downloaded and installed before this steps. MAFFT Download instructions here.
#Command Parameters:
#[the location of the mafft.bat ] [our database file ] > [the neame of the new file to be aligned.msa]
../mafft-mac/mafft.bat uniprot_opsin_db.fasta > aligned_uniprot_opsin_db.msa
#then glimpse at the file created
nano aligned_uniprot_opsin_db.msa
Aligned sequences after
the alignement was performed .
NOTE: HMMER v3.3.2 (EddyRivasLab, 2018) must be downloaded and installed before this steps. HMMER Download instructions here.
#Command Parameters: [hmmrbuild] [name of the new file to be created .hmm] [the name of the aligned db file .msa]
hmmbuild hmmb_aligned_uniprot_opsin_db.hmm aligned_uniprot_opsin_db.msa
#Inspection of the hmm file was done by using the following commands
ls -lh
head *.hmm
cat *.hmm
Sequences after
the profile was built
NOTE: FastQC Use instructions here. Trimmomatic v.0.27 (Bolger et al, 2014) must be downloaded and installed before these steps. Trimmomatic Download instructions here.
#make a trimmomatic file sbatch .sh
#The following are the contents of trim.sh
$nano trim.sh
#! /bin/bash
java -jar /opt/Trimomatic-0.3/trimmomatic \
PE \
-phread33 \
-threads 6 \
SRRFILE_1.fastq.gz\ #feed in forward
SRRFILE_2.fastq.gz\ #feed in reverse file
clean_1.fastq.gz\ #output of the clean forward reads
unpaired_1.fastq.gz\ #output of the unpair forward reads
clean_2.fastq.gz\ #output of the clean reverse reads
unpaired_2.fastq.gz\ #output of the unpair reverse reads
ILLUMINACLIP:/opt/Trimmomatic-0.38/adapters/allAdapter.fas:2:30:10 \ #READ our set of sequences in the adapters folder, and it will determine the quality of how close we want these adapter to find our sequences.
LEADING: 30\ #If the quality is higher than 30, it will be kept.
TRAILING: 30\
HEADCROP: 5:30\ #removed the first 5 bases
SLIDINGWINDOW: 5:30 \
MINLEN:50\ #Minimum lenght
$CTRL + O
$CTRL + X
#at the end of this, save this where your files SRRFILE_1.fastq.gz\ and SSFILE_2.fastq.gz\ are located
#you may now analyse these files by using fastqc, again.
NOTE: Trinity v2.15.1 (Grabherr et al, 2011) must be downloaded and installed before these steps. Trinity Download instructions here.
# Run trinity via: $TRINITY_HOME/Trinity
# At this point Trinity should be installed.
# NOW, To Launch Trinity, simply do the command:
$Trinity
#Using Trinity (prl script)
#go to the desired location where your clean forward and reverse files are: clean_1.fastq.gz and clean_2.fastq.gz
#Now let's run the trinity assembly (this will take several hours)
$ .../opt/trinity/Trinity/ --seqType fq --max_memory 50G --left clean_1.fq --right clean_2.fq --CPU 6 --SS_lib_type RF
####At the end of trinity, the output needed to proceed will be the file
trinity.fasta
A glimpse of the Trinity
output,
Trinity.fasta file.
NOTE: TransDecoder v5.7.0 (Haas, n.d.) must be downloaded and installed before these steps. Transdecoder Download instructions here.
# Extract the base frequencies with transdecoder on the trinity ouput :
$TransDecoder.Longorfs -t ../data/fasta_files/EP017.fasta
//output
#-first extracting base frequencies, we'll need them later.
#- extracting ORFs from transcripts.
#-total transcripts to examine: 93298
#[93200/ 93298] = 99.89% done CMD: touch /Users/ernievilela/Documents/Classes/Spring_2023/Senior_Thesis/TransDecoder-TransDecoder-v5.7.0/EP017.fasta.transdecoder_dir.__checkpoints_longorfs/TD.longorfs.ok
#################################
### Done preparing long ORFs. ###
##################################
# Use file: /Users/ernievilela/Documents/Classes/Spring_2023/Senior_Thesis/TransDecoder-TransDecoder-v5.7.0/EP017.fasta.transdecoder_dir/longest_orfs.pep for Pfam and/or BlastP searches to enable homology-based coding region identification.
# Then, run TransDecoder.Predict for your final coding region predictions.
#Once this is done running, run TransDecoder.Predict
$TransDecoder.Predict -t ../data/fasta_files/EP017.fasta
#Making cds file: EP017.fasta.transdecoder.cds
#transdecoder is finished. See output files EP017.fasta.transdecoder.*
A glimpse of the file
containing the peptide sequences from the TransDecoder ouput,
longest_orfs.pep file.
##At this point, make sure you copy your database (sequences to identify) to the hmmer folder.
#Searching using HMMER
#Command Parameters: [hmmsearch ] [ dbfile.hmm] [fullgenomefile.fasta]
hmmsearch -A EP017_hits.txt hmmb_aligned_uniprot_opsin_db.hmm /Users/ernievilela/Documents/Classes/Spring_2023/Senior_Thesis/TransDecoder-TransDecoder-v5.7.0/transdecoder_output/EP017_transdecoder_output/EP017.fasta.transdecoder_dir/longest_orfs.pep
#Now we reformat from .txt to fasta to get all the hits
esl-reformat fasta EP017_hits.txt > EP017_hits.fasta
#this is the end! We found our hits!!!
#Now we look for those matches that are labeled complete
$grep -c 'complete' EP017_hits.fasta
HMMER was able to produce a list of matches or hits of opsins for each transcriptome.
A glimpse of the file containing all the database hits and matches to
opsin genes, opsin_hits.fasta file.
Almudi, I., Vizueta, J., Wyatt, C.D.R. et al. Genomic adaptations to aquatic and aerial life in mayflies and the origin of insect wings. Nat Commun 11, 2631 (2020). https://doi.org/10.1038/s41467-020-16284-8
Andrews S. (2010). FastQC: a quality control tool for high throughput sequence data. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc
Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014 Aug 1;30(15):2114-20. doi: 10.1093/bioinformatics/btu170. Epub 2014 Apr 1. PMID: 24695404; PMCID: PMC4103590.
EddyRivasLab. (2018). EddyRivasLab/hmmer. Retrieved from https://github.com/EddyRivasLab/hmmer
Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, di Palma F, Birren BW, Nusbaum C, Lindblad-Toh K, Friedman N, Regev A. Full-length transcriptome assembly from RNA-seq data without a reference genome. Nat Biotechnol. 2011 May 15;29(7):644-52. doi: 10.1038/nbt.1883. PubMed PMID: 21572440.
Guignard Q, Allison JD, Slippers B. The evolution of insect visual opsin genes with specific consideration of the influence of ocelli and life history traits. BMC Ecol Evol. 2022 Jan 7;22(1):2. doi: 10.1186/s12862-022-01960-8. PMID: 34996358; PMCID: PMC8739693.
Hölzer, M., & Marz, M. (2019). De novo transcriptome assembly: A comprehensive cross-species comparison of short-read RNA-Seq assemblers.
Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013 Apr;30(4):772-80. doi: 10.1093/molbev/mst010. Epub 2013 Jan 16. PMID: 23329690; PMCID: PMC3603318.
Lucidchart. (2023) Created in Lucidchart, www.lucidchart.com
Ogden, T.H., Gattolliat, J.L., Sartori, M., Staniczek, A.H., Soldán, T. and Whiting, M.F. (2009), Towards a new paradigm in mayfly phylogeny (Ephemeroptera): combined analysis of morphological and molecular data. Systematic Entomology, 34: 616-634. https://doi.org/10.1111/j.1365-3113.2009.00488.xPennisi, E. (2020, June 2). Insect Wings evolved from legs, Mayfly genome suggests. Science. Retrieved April 20, 2023, from https://www.science.org/content/article/insect-wings-evolved-legs-mayfly-genome-sugge sts
Regev A. Full-length transcriptome assembly from RNA-seq data without a reference genome. Nat Biotechnol. 2011 May 15;29(7):644-52.
Suvorov, Anton et al. (2016), Opsins have evolved under the permanent heterozygote model: insights from phylotranscriptomics of Odonata, Molecular Ecology, Article-journal, https://doi.org/10.1111/mec.13884
Szaz D, Horvath G, Barta A, Robertson BA, Farkas A, Egri A, Tarjanyi N, Racz G, Kriska G. Lamp-lit bridges as dual light-traps for the night-swarming mayfly, Ephoron virgo: interaction of polarized and unpolarized light pollution. PLoS One. 2015 Mar 27;10(3):e0121194. doi: 10.1371/journal.pone.0121194. PMID: 25815748; PMCID: PMC4376897.
The UniProt Consortium, UniProt: the Universal Protein Knowledgebase in 2023, Nucleic Acids Research, Volume 51, Issue D1, 6 January 2023, Pages D523–D531, https://doi.org/10.1093/nar/gkac1052
Haas, BJ (n.d.). TransDecoder-TransDecoder-v5.7.0. Retrieved from https://github.com/TransDecoder/TransDecoder/wiki.
National Human Genome Research Institute(NHGRI) . Transcriptome. (2015). In Transcriptome. https://link.gale.com/apps/doc/A418467482/AONE?u=utahvalley&sid=ebsco&xid=35791e01
Ernie Vilela - 2023